Db2 Statistical Functions

Db2 already has a variety of Statistical functions built in. In Db2 11.1, a number of new functions have been added including:

  • COVARIANCE_SAMP - The COVARIANCE_SAMP function returns the sample covariance of a set of number pairs
  • STDDEV_SAMP - The STDDEV_SAMP column function returns the sample standard deviation (division by [n-1]) of a set of numbers.
  • VARIANCE_SAMP or VAR_SAMP - The VARIANCE_SAMP column function returns the sample variance (division by [n-1]) of a set of numbers.
  • CUME_DIST - The CUME_DIST column function returns the cumulative distribution of a row that is hypothetically inserted into a group of rows
  • PERCENT_RANK - The PERCENT_RANK column function returns the relative percentile rank of a row that is hypothetically inserted into a group of rows.
  • PERCENTILE_DISC, PERCENTILE_CONT - Returns the value that corresponds to the specified percentile given a sort specification by using discrete (DISC) or continuous (CONT) distribution
  • MEDIAN - The MEDIAN column function returns the median value in a set of values
  • WIDTH_BUCKET - The WIDTH_BUCKET function is used to create equal-width histograms

Sampling Functions

The traditional VARIANCE, COVARIANCE, and STDDEV functions have been available in Db2 for a long time. When computing these values, the formulae assume that the entire population has been counted (N). The traditional formula for standard deviation is:

$$\sigma=\sqrt{\frac{1}{N}\sum_{i=1}^N(x_{i}-\mu)^{2}}$$

N refers to the size of the population and in many cases, we only have a sample, not the entire population of values. In this case, the formula needs to be adjusted to account for the sampling.

$$s=\sqrt{\frac{1}{N-1}\sum_{i=1}^N(x_{i}-\bar{x})^{2}}$$

Set up the connection to the database.


In [ ]:
%run db2.ipynb

We populate the database with the EMPLOYEE and DEPARTMENT tables so that we can run the various examples.


In [ ]:
%sql -sampledata

COVARIANCE_SAMP

The COVARIANCE_SAMP function returns the sample covariance of a set of number pairs.


In [ ]:
%%sql
SELECT COVARIANCE_SAMP(SALARY, BONUS) 
  FROM EMPLOYEE 
WHERE WORKDEPT = 'A00'

STDDEV_SAMP

The STDDEV_SAMP column function returns the sample standard deviation (division by [n-1]) of a set of numbers.


In [ ]:
%%sql
SELECT STDDEV_SAMP(SALARY) 
  FROM EMPLOYEE 
WHERE WORKDEPT = 'A00'

VARIANCE_SAMP

The VARIANCE_SAMP column function returns the sample variance (division by [n-1]) of a set of numbers.


In [ ]:
%%sql
SELECT VARIANCE_SAMP(SALARY) 
  FROM EMPLOYEE 
WHERE WORKDEPT = 'A00'

MEDIAN

The MEDIAN column function returns the median value in a set of values.


In [ ]:
%%sql
SELECT MEDIAN(SALARY) AS MEDIAN, AVG(SALARY) AS AVERAGE 
  FROM EMPLOYEE 
WHERE WORKDEPT = 'E21'

CUME_DIST

The CUME_DIST column function returns the cumulative distribution of a row that is hypothetically inserted into a group of rows.


In [ ]:
%%sql
SELECT CUME_DIST(47000) WITHIN GROUP (ORDER BY SALARY) 
  FROM EMPLOYEE 
WHERE WORKDEPT = 'A00'

PERCENT_RANK

The PERCENT_RANK column function returns the relative percentile rank of a row that is hypothetically inserted into a group of rows.


In [ ]:
%%sql
SELECT PERCENT_RANK(47000) WITHIN GROUP (ORDER BY SALARY) 
  FROM EMPLOYEE 
WHERE WORKDEPT = 'A00'

PERCENTILE_DISC

The PERCENTILE_DISC/CONT returns the value that corresponds to the specified percentile given a sort specification by using discrete (DISC) or continuous (CONT) distribution.


In [ ]:
%%sql
SELECT PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY SALARY) 
  FROM EMPLOYEE 
WHERE WORKDEPT = 'E21'

PERCENTILE_CONT

This is a function that gives you a continuous percentile calculation.


In [ ]:
%%sql
SELECT PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY SALARY) 
  FROM EMPLOYEE 
WHERE WORKDEPT = 'E21'

WIDTH BUCKET and Histogram Example

The WIDTH_BUCKET function is used to create equal-width histograms. Using the EMPLOYEE table, This SQL will assign a bucket to each employee's salary using a range of 35000 to 100000 divided into 13 buckets.


In [ ]:
%%sql
SELECT EMPNO, SALARY, WIDTH_BUCKET(SALARY, 35000, 100000, 13) 
  FROM EMPLOYEE 
ORDER BY EMPNO

We can plot this information by adding some more details to the bucket output.


In [ ]:
%%sql -a
WITH BUCKETS(EMPNO, SALARY, BNO) AS 
  ( 
  SELECT EMPNO, SALARY, 
    WIDTH_BUCKET(SALARY, 35000, 100000, 9) AS BUCKET 
  FROM EMPLOYEE ORDER BY EMPNO 
  ) 
SELECT BNO, COUNT(*) AS COUNT FROM BUCKETS 
GROUP BY BNO 
ORDER BY BNO ASC

And here is a plot of the data to make sense of the histogram.


In [ ]:
%%sql -pb
WITH BUCKETS(EMPNO, SALARY, BNO) AS 
  ( 
  SELECT EMPNO, SALARY, 
    WIDTH_BUCKET(SALARY, 35000, 100000, 9) AS BUCKET 
  FROM EMPLOYEE ORDER BY EMPNO 
  ) 
SELECT BNO, COUNT(*) AS COUNT FROM BUCKETS 
GROUP BY BNO 
ORDER BY BNO ASC

Credits: IBM 2018, George Baklarz [baklarz@ca.ibm.com]